Introduction

I’ve been alsways interested in statistics and sociology of crime. Questions like ‘Is there a connection between crime and poverty?’, ‘Does level of educaiton correlates with crime rate?’ or ‘Are factors correlated with property and personal crimes different or not?’ have been dragging my attention for a long time.

This paper is an attempt to find which demographic factors correlate with level of crime using techniques from Udacity’s course ‘Exploratory data analysis with R’. I’ll try to select demographic factors that might correlate with crime rate and cover the following broad topics:

Research question

My starting point was to figure out how to get as granular dataset as possible both for crime and for demographic data. As it turns out, several major cities in US provide a dataset of every reported crime, with crime description and, which is even more important, geographical locations of reported crime. I obtained these datasets for two large US cities: Los Angeles and Chicago for the year 2013.

Primary source of varous demographic data for the United States is The United States Census Bureau. It provides demographic data at various geographic levels through its site. The smallest geographical unit for which the bureau publishes sample data is census block group. Additionally, bureau provides geographical shapes of these block groups. This allows us to determine block group of every reported crime, since our initial crime dataset contains information about geographical coordinates of the crime.

Next step was to determine which demographic factors I should select for this paper. There is plenty of research around factors influencing crime. Most of the research describing factors influencing crime, cite these demographic parameters as the ones that have the most incluence to crime rate:

  • Level of urbanization
  • Median income
  • Unemployment
  • Education level
  • Concentraiton of youth
  • Family condition, especially in respect to divorse, family composition etc.

Within the current paper I decided to focus on following variables:

  • Population density,
  • Median personal income,
  • Share of unemployed,
  • Share of those with educaiton level of at least one year of college or higher.

All this data is available for download on census block group level through census.gov.

Constructing research dataset.

Creating clean dataset for this paper was a long and time consuming task. It contained multiple steps, involving multiple technologies, like Excel, GIS, R and SQL. I used SQL for loading and merging data, and geographical extentions for Postgres database named PostGIS for geographical calculations. I also used Excel and R for cleaning up downloaded datasets. The process is described in detail in a separate document.

As part of cleaning up dataset, I had to assign manually if the crime was personal or property crime. Original crime datasets contained a column “crime type”, however these were more granular crime types. More over, each police departments has its own set of crime types. In my case this column contained rather obvious types, such as ‘Arson’, ‘Assault’, ‘Homicide’ or ‘Theft’, as well as more obscure or granular types, such as ‘OTHER MISCELLANEOUS CRIME’ or ‘THEFT, COIN MACHINE’. I constructed a separate dataset that matches between reported crime type form dataset, and crime types used in this research: ‘personal’, ‘property’, ‘other’.

At the end I constructed two datasets about crime and demographic statistics in Los Angeles and Chicago:

  1. crime_reports.csv, containing crime reports for 2013 in Los Angeles and Chicago. Its columns are:
    1. crime, crime type as found in report.
    2. lat and lon - latitude and longitue associated with the crime report
    3. reported_at. Date and hour when crime was reported to happen.
    4. city. Los Angeles or Chicago
    5. type. Personal, property or other.
  2. crime_demo_data.csv, amount of personal, property and other crime by census block group, with demographic variables. In addition to census block identificator and crime levels, following demographic variables are present:
    1. Population density, persons per sqkare kilometer;
    2. Median personal income, US Dollars;
    3. Share of unemployed
    4. Share of people with at least a year of college education.

Analysis and exploration of data

In this section I will look at two datasets in more detail, and will try to provide a deeper overview of crime situation in both cities. Let’s start with exploring basic descriptive statistics of crime reports dataset.

Descriptive analysis of crime dataset

There are 535912 crime reports in our dataset, 304,372 or 56.8% of our dataset are reported in Chicago and 231,540 or 43.2% were reported in Los Angeles. Crime rates (number of reproted crimes per 100,000) are given in table below:

Crime rate in Chicago and Los Angeles, 2013
personal property total
Chicago 4,169 5,834 11,248
Los Angeles 2,558 3,356 6,105

And this is the breakdown between crime types and cities in absolute numbers:

Reported crime in Chicago and Los Angeles, 2013
personal property other
Chicago 112,830 157,875 33,667
Los Angeles 97,020 127,297 7,223

Chicago has more crime reported in 2013 than Los Angeles, both in absolute and expecially in relative numbers. Difference in crime rate is especially high for personal crimes: Chicago’s rate of personal crimes is around 60% higher than Los Angeles’. Most of the crimes were property crimes in both cities.

Let’s look at the time when crimes are happening. Hourly patterns by crime type and city are plotted below.

Share of crime reports by hour

Some patterns are hte same across the cities and crime types. For example, lowest crime rate is during early morning hours between 4 and 6 AM. Also it is surprizing that crime reports tend to be reported more at odd hours, as we see from jagged lines in all the facets.

However crime in Chicago and Los Angeles are different in several ways. As we saw in table above, personal crimes in Chicago have larger share than in Los Angeles. Another difference is that property crimes tend to be happening at different times in these two cities: maximum share of reported property crimes in Chicago is at 9AM, while in Los Angeles maximum is at noon.

Let’s look at weekday patterns.

Share of crime reports by weekday

Here we see slightly different crime patterns between two cities. Most crimes in Chicago are reported at the begininng of the week, while in Los Angeles they tend to be happening in the middle of the week.

Finally let’s plot crime reports on map of respected cities.

Property and personal crimes in Chicago and Los Angeles

This is the most interesting plot so far. Both personal and property crimes in Los Angeles are concentrated in a single area around Skid Row and Downton Los Angeles. In Chicago though both types of crime are concentrated in completely different areas. Property crimes are clustered around Chicago city center: Near North Side, Chicago Loop and River North. Personal crimes are concentrated heavily around western areas of Chicago: North and South Lawndale, Near West Side.

Final plots and summary

Let’s start with plotting our demographic variables on a city map, to find visual clues about connection between demography and crime level. This is map of Los Angeles with four demographic variables on it:

## OGR data source with driver: ESRI Shapefile 
## Source: "raw/tl_2013_06_bg", layer: "tl_2013_06_bg"
## with 23212 features
## It has 12 fields

Demography maps of Los Angeles

Same variables for Chicago look like this:

## OGR data source with driver: ESRI Shapefile 
## Source: "raw/tl_2013_17_bg/", layer: "tl_2013_17_bg"
## with 9691 features
## It has 12 fields

Demography maps of Chicago

There is some pattern between median income and education level from one side and crime level on another side. In both cities median income is lower in the areas where personal crime rate is higher, and education level is also lower in the areas with high level of personal crime. However there is no clearly visible pattern for density and unemployment level.

Let’s move to our main research question, namely finding of there is any correlation between four selected demographic variables.

Scatteplots of number of various types of crime and demograpfic parameters by block group are plotted below:

Crime reports and demographic variables

Overall the correlation between our demographic parameters and crime reports looks rather weak. Correlation between education level and amount of reported crime looks weakest of all, while median income and unemployment level correlate visibly better.

Let’s look at correlation matrix: how number of crime reports by block group correlate with our four demographic parameters.

Total crimes Property crimes Personal crimes
Population density 0.136 0.167 0.096
Median Income -0.162 -0.049 -0.245
Share of unemployed 0.236 0.133 0.272
Share of at least a year in college -0.137 -0.030 -0.224

The table above corroborates with our previous visualizations: indeed all of the demograpfic variables correlate with number of crime reports very weakly. Let’s look at correlations at level of cities.

Correlations between crime reports and demography, Los Angeles
Total crimes Property crimes Personal crimes
Population density 0.247 0.242 0.231
Median Income -0.131 -0.089 -0.164
% of unemployed 0.057 0.042 0.060
% of at least a year in college -0.076 -0.024 -0.131
Correlations between crime reports and demography, Chicago
Total crimes Property crimes Personal crimes
Population density 0.18528 0.23793 0.06282
Median Income -0.15135 0.04073 -0.32675
% of unemployed 0.25443 0.09741 0.35156
% of at least a year in college -0.17217 0.00041 -0.32294

And again, we see that correlation between our four demographic variables and number of reported crimes in both cities is weak. Chicago shows somewhat higher correlation coefficient of two cities, but still they are too close to zero to be significant.

Reflection

After all my research failed to show significant correlation between level of reported crime in Chicago and Los Angeles, and either of selected four demographic characteristics:

While we clearly see that in both cities crimes are heavily concentrated in certain city areas, correlation coefficient between number of reported crimes and demographic characteristics on census block level was too close to zero.

So what could go wrong in my analysis? What could be the reasons why I failed to uncover strong correlations between median income, population density, unemployment and education level, and crime rate? After all these four parameters are associated with crime rate extremely often in relevant literature.

First thing that could be wrong is the selected level of granularity of my geographical data. Census blocks, which are the lowest geographical level for which census fata is provided publicly, might be too small for the broad analysis attempted in this paper. This is especially visible on my plots of density level across census blocks: they look to be sitributed randomly. Plotting same data on larged geographical units might give us a different picture.

A lot of effort was spent on normalizing and merging raw crime datasets. One of the tasks included normalizing crime types, making them uniform across all four datasets. I could make mistakes in attributing crime type reorted in original dataset to crime type used in this paper.

Besides above problems, there are also other broader issues with taken approach. For example, it does not contain any temporal aspect. Dataset covers only year 2013. This would be crucial had I attempted to perform cause-and-effect analysis.

However despite of my failure to find some support to my research question, plenty of positive things can be found in this grading paper.

For example, I showed that there is a lot of potential value in combining publicly available datasets from different sources. Plenty of demographical data is available for various geographical blocks on several levels from American County Survey and census.

Further research on using datasets used in this paper could include: